Data Visualization Techniques

Venustiano Soancatl Aguilar

Content

  • The grammar of graphics
  • The major components of layers
  • Hands on practice
  • Visualizations based on the gg approach

The grammar of graphics


The grammar of graphics is about grammatical rules for creating perceivable graphs, or what we call graphics. (Leland Wilkinson, 2005).


Take the analogy: good grammar is just the first step in creating a good sentence.

An Object-Oriented Graphics System

  1. Specification
    1. DATA : a set of data operations that create variables from datasets,
    2. TRANS : variable transformations (e.g., rank),
    3. SCALE : scale transformations (e.g., log),
    4. COORD : a coordinate system (e.g., polar),
    5. ELEMENT : graphs (e.g., points) and their aesthetic attributes (e.g., color),
    6. GUIDE : one or more guides (axes, legends, etc.).
  2. Assembly
  3. Display

Graphics Pipeline

  • Algebra, the operations that allow us to combine variables and specify dimensions of graphs.
  • Scales involves the representation of variables on measured dimensions.
  • Statistics covers the functions that allow graphs to change their appearance and representation schemes.
  • Geometry covers the creation of geometric graphs from variables.

A layered grammar of graphics

Layers of the grammar of graphics


A layer is composed of

  1. data and aesthetic mappings
  2. a geometric object
  3. a statistical transformation
  4. a position adjustment

1. Data and aesthetic mapping

Aestetic mappings

2. Geometric objects

Geometric object

A sample of geometric objects

Graphical primitives

  • geom_path()
  • geom_rect()
  • geom_poligon()

One variable

  • Discrete
    • geom_bar()
  • Continuous
    • geom_histogram()
    • geom_density()

Two variables

  • Both continuous
    • geom_smooth()
    • geom_point()
  • At least one discrete
    • geom_count()
    • geom_jitter()
  • One continuous one discrete
    • geom_boxplot().
    • geom_violin()

Three variables

  • geom_contour()
  • geom_tile()
  • geom_raster()

Aesthetics mapping in practice

library(dviz.supp)
library(forcats)
library(lubridate)

if (!requireNamespace("gt")) install.packages("gt")
library(gt)
Daily temprature data
station_id month day temperature flag date location
USC00042319 01 1 51.0 S 0-01-01 Death Valley
USC00042319 01 2 51.2 S 0-01-02 Death Valley
USC00042319 01 3 51.3 S 0-01-03 Death Valley
USC00042319 01 4 51.4 S 0-01-04 Death Valley
USC00042319 01 5 51.6 S 0-01-05 Death Valley
USC00042319 01 6 51.7 S 0-01-06 Death Valley

Mapping and geometry

p <- ggplot(temps_long, 
            aes(x = date, 
                y = temperature, 
                color = location)
            ) +
  geom_line(linewidth = 1) +
  scale_x_date(name = "month", 
               limits = c(ymd("0000-01-01"), ymd("0001-01-04")),
               breaks = c(ymd("0000-01-01"), ymd("0000-04-01"), ymd("0000-07-01"),
                          ymd("0000-10-01"), ymd("0001-01-01")),
               labels = c("Jan", "Apr", "Jul", "Oct", "Jan"), expand = c(1/366, 0)) + 
  scale_y_continuous(limits = c(19.9, 107),
                     breaks = seq(20, 100, by = 20),
                     name = "temperature (°F)") +
  scale_color_OkabeIto(order = c(1:3, 7), name = NULL) +
  theme_dviz_grid() +
  theme(legend.title.align = 0.5)

Temperature plot

Seaborn and the Grammar of Graphics

# Create plot
fig, ax = plt.subplots(figsize=(9, 5))

# Use seaborn lineplot; pass palette by mapping
sns.lineplot(
    data=lf,
    x='date',
    y='temperature',
    hue='location',
    palette=palette_map,
    linewidth=1.5,  # similar to geom_line linewidth
    ax=ax
)

# X-axis limits and breaks (use valid years 2000-01-01 to 2001-01-04)
xmin = pd.to_datetime("2000-01-01")

Temperature plot using Seaborn

(np.float64(10957.0), np.float64(11326.0))
(19.9, 107.0)

Changing the geometry to heatmap

Preprocessing:

  • Compute mean by location & month
  • Replace month numbers with names
Mean temperature per month
location month mean
Death Valley Jan 53.45161
Death Valley Feb 59.94483
Death Valley Mar 68.44839
Death Valley Apr 76.29333
Death Valley May 86.60645
Death Valley Jun 95.54667

Aesthetics mapping and geometry

p <- ggplot(mean_temps, 
            aes(x = month, y = location, fill = mean)) + 
     geom_tile(width = .95, height = 0.95) +
     scale_fill_viridis_c(option = "B", begin = 0.15, end = 0.98,
                       name = "temperature (°F)") + 
     scale_y_discrete(name = NULL) +
     ...

3. Statistical transformations

Common statistical transformations


ggplot2 stat_ functions
Table adapted from Hadley Wickham (2016),
Name Description
bin Divide continuous range into bins, and count number of points in each
boxplot Compute statistics necessary for boxplot
contour Calculate contour lines
density Compute 1d density estimate
identity Identity transformation, f(x) = x
jitter Jitter values by adding small random value
qq Calculate values for quantile-quantile plot
quantile Quantile regression
smooth Smoothed conditional mean of y given x
summary Aggregate values of y for given x
unique Remove duplicated observations

Contours

Blue jay relationship between body mass and head length.


Blue jay dataset
BirdID KnownSex BillDepth BillWidth BillLength Head Mass Skull Sex
0000-00000 M 8.26 9.21 25.92 56.58 73.30 30.66 1
1142-05901 M 8.54 8.76 24.99 56.36 75.10 31.38 1
1142-05905 M 8.39 8.78 26.07 57.32 70.25 31.25 1
1142-05907 F 7.78 9.30 23.48 53.77 65.50 30.29 0
1142-05909 M 8.71 9.84 25.47 57.32 74.90 31.85 1
1142-05911 F 7.28 9.30 22.25 52.25 63.90 30.00 0

Contour plot, first version

blue_jays_base <- ggplot(blue_jays, aes(Mass, Head)) + 
  scale_x_continuous(limits = c(57, 82), expand = c(0, 0), name = "body mass (g)") +
  scale_y_continuous(limits = c(49, 61), expand = c(0, 0), name = "head length (mm)" ) +
  theme_dviz_grid()

blue_jays_base + 
  stat_density_2d(color = "black", size = 0.4, binwidth = 0.004) +
  geom_point(color = "black", size = 1.5, alpha = 1/3)
# Shading
blue_jays_base + 
  stat_density_2d(aes(fill = ..level..), geom = "polygon", color = "black", size = 0.15, binwidth = 0.004) +
  geom_point(color = "black", size = 1.5, alpha = .4) +
  scale_fill_gradient(low = "grey95", high = "grey70", guide = "none")

Grouping by sex

blue_jays_base + 
  aes(color = KnownSex) +
  stat_density_2d(size = 0.4, binwidth = 0.006) +
  geom_point(size = 1.5, alpha = 0.7) +
  ...

Bins

Common applications:

  • Histograms
  • Contours
  • Heatmaps, aggregate values into grid cells to display intensity across two dimensions
  • Temporal aggregation
  • Large-data intensity approximation

Mass spectrometry

Prompt: Given a pandas dataframes with more than 200 million rows and an 'mz' column having more thatn 26 million unique values. How can the table be aggregated in such a way that we can create a heat mpa wint 'mz' on the vertical axis, time on the horizontal axis and intensity on the 'z' axis (color)?

4. Position adjustment